mcelog: memory error handling in user space

نویسنده

  • Andi Kleen
چکیده

Servers and high-performance computing systems contain more and more memory to handle bigger data sets. But with more and larger memory modules, and more transistors in them, combined with larger clusters of systems, the rate of memory errors in operation is also increasing. Modern server systems generally use ECC memory and other ways to detect and correct many memory errors in the hardware. When the hardware corrects an error it generates corrected error events. These events can be also used by specialized software to prevent future failures. mcelog is a daemon for handling and reporting hardware errors. It is able to use trends in corrected error reports to implement specific error prevention algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

EHCtor: Detecting Resource-Release Omission Faults in Error-Handling Code for Systems Software

Adequate error-handling code is essential to the reliability of any system. On an error, such code is responsible for releasing acquired resources to restore the system to a viable state. Missing resource-release operations can lead to system crashes, memory leaks and deadlocks. A number of approaches have been proposed to detect such problems, but they mainly target frequently occurring resour...

متن کامل

Fault Tolerant Memory In Processor - SuperComputer On a Chip

Soft errors are adding another dimension to the present day architecture design space. Different techniques like redundant multithreading are evolved for handling them. The Memory In Processor (MIP) architecture provides fine grain processor memory integration. This integration provides efficient support for redundant multithreading within a functional unit. Detecting errors in intermediate sta...

متن کامل

DAChe: Direct Access Cache System for Parallel I/O

One of the largest challenges in client-side caching in extremely large-scale environments is consistency and coherency. By handling a user-space cache, we can offer applications much closer control over our client-side cache and scale the cache with the size of the compute resources (i.e. compute nodes). Cache data is shared among each compute node analagous to a traditional shared memory mach...

متن کامل

Flexible Memory Management in a Single Address Space Operating System Supporting Quality of Service

Recent developments in the area of memory management have focused on reducing the eeects of the disk latency problem. These developments include the use of a compressed cache and the utilisation of memory on remote hosts. Further developments in the areas of Quality of Service (QoS) provision and user-level virtual memory have provided the impetus for a more spe-cialised run-time environment th...

متن کامل

Mobile Robot Navigation Error Handling Using an Extended Kalman Filter

Obviously navigation is one of the most complicated issues in mobile robots. Intelligent algorithms are often used for error handling in robot navigation. This Paper deals with the problem of Inertial Measurement Unit (IMU) error handling by using Extended Kalman Filter (EKF) as an Expert Algorithms. Our focus is put on the field of mobile robot navigation in the 2D environments. The main chall...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010